Volatility is one of the most prominent terms you’ll hear on any trading floor and for good reason. In financial markets, volatility captures the amount of fluctuation in prices. High volatility is associated to periods of market turbulence and to large price swings, while low volatility describes more calm and quiet markets. For trading firms like Optiver, accurately predicting volatility is essential for the trading of options, whose price is directly related to the volatility of the underlying product. In this exercise, we will explore the order book dataset provided by Optiver to predict future realised volatility. An order book lists the number of shares being bid on or offered at each price point. For simplicity, we only provider the highest two bid prices and lowest two ask prices in the order book (which would not affect computation of the realised volatility). Base on the order book, we need to be able to compute statistics like the weighted average price (WAP) and log returns. We will then calculate the realised volatility during any fixed time interval based on the log returns. Finally, we will use historical data to estimate future volatility.
library(ggplot2)
library(dplyr)
library(rugarch)
The data we look at today are standard csv files provided by Optiver. Each csv file corresponds to order book snapshots of a specific stock at some specific time points. With one-second resolution, it provides a uniquely fine grained look at the micro-structure of modern financial markets. We will load the data for one stock (“stock_1.csv”) in this lab, and full data sets of 127 stocks are available to students for analysis if they select this project.
Each stock has consecutive order book snapshots (with one-second
resolution) for multiple 10-min time intervals. Each row of the csv file
is an order book snapshot of a stock at a specific time point. The
column time_id indicates the ID code for the 10-min time
bucket. Time IDs are not necessarily sequential but are consistent
across all stocks. Each row of the csv file also provides the highest
(lowest) two bid (ask) prices and its corresponding sizes at the
specific time point.
Suggestions
stock1 <- read.csv("data/individual_book_train/stock_1.csv")
dim(stock1)
## [1] 1507532 11
head(stock1)
## time_id seconds_in_bucket bid_price1 ask_price1 bid_price2 ask_price2
## 1 5 0 1.000754 1.001542 1.000689 1.001607
## 2 5 1 1.000754 1.001673 1.000689 1.001739
## 3 5 2 1.000754 1.001411 1.000623 1.001476
## 4 5 3 1.000754 1.001542 1.000689 1.001607
## 5 5 4 1.000754 1.001476 1.000623 1.001542
## 6 5 5 1.000754 1.001542 1.000623 1.001673
## bid_size1 ask_size1 bid_size2 ask_size2 stock_id
## 1 1 25 25 100 1
## 2 26 60 25 100 1
## 3 1 25 25 125 1
## 4 125 25 126 36 1
## 5 100 100 25 25 1
## 6 100 25 25 60 1
Compute the BidAskSpread and WAP for all the snapshots of stock 1.
Add the results as two new columns of stock1. Compute the
log return of stock 1 when time_id == 5 and
seconds_in_bucket == 16.
Suggestions
stock1 <- stock1 %>% mutate(
WAP = (bid_price1 * ask_size1 + ask_price1 * bid_size1) / (bid_size1 + ask_size1))
stock1 <- stock1 %>% mutate(BidAskSpread = ask_price1 / bid_price1 - 1)
head(stock1)
## time_id seconds_in_bucket bid_price1 ask_price1 bid_price2 ask_price2
## 1 5 0 1.000754 1.001542 1.000689 1.001607
## 2 5 1 1.000754 1.001673 1.000689 1.001739
## 3 5 2 1.000754 1.001411 1.000623 1.001476
## 4 5 3 1.000754 1.001542 1.000689 1.001607
## 5 5 4 1.000754 1.001476 1.000623 1.001542
## 6 5 5 1.000754 1.001542 1.000623 1.001673
## bid_size1 ask_size1 bid_size2 ask_size2 stock_id WAP BidAskSpread
## 1 1 25 25 100 1 1.000785 0.0007868064
## 2 26 60 25 100 1 1.001032 0.0009179074
## 3 1 25 25 125 1 1.000780 0.0006556053
## 4 125 25 126 36 1 1.001411 0.0007868064
## 5 100 100 25 25 1 1.001115 0.0007212558
## 6 100 25 25 60 1 1.001384 0.0007868064
# To compute the log return, we should know the WAP at two consecutive times
WAP1 <- stock1 %>% filter(time_id == 5 & seconds_in_bucket == 16) %>% pull(WAP)
WAP2 <- stock1 %>% filter(time_id == 5 & seconds_in_bucket == 15) %>% pull(WAP)
log_r <- log(WAP1 / WAP2)
print(log_r)
## [1] -0.0003073623
We will first compute the log returns at each time points. Based on these log returns, we can then calculate the realised volatility during any specific time interval. In this lab, we set every (non-overlapping) 30 seconds as a fixed time interval to compute realised volatility.
[a] We consider the first 500 time_id
of stock 1 for illustration. Describe what do the following lines of
code do? (This chunk may take up to 1 min to run.)
log_r1 <- list()
time_IDs <- unique(stock1[, 1])[1:500]
for (i in 1 : length(time_IDs)) {
sec <- stock1 %>% filter(time_id == time_IDs[i]) %>% pull(seconds_in_bucket)
price <- stock1 %>% filter(time_id == time_IDs[i]) %>% pull(WAP)
log_r <- log(price[-1] / price[1:(length(price) - 1)])
log_r1[[i]] <- data.frame(time = sec[-1], log_return = log_r)
time.no.change <- (1:600)[!(1:600 %in% log_r1[[i]]$time)]
if (length(time.no.change) > 0) {
new.df <- data.frame(time = time.no.change, log_return = 0)
log_r1[[i]] <- rbind(log_r1[[i]], new.df)
log_r1[[i]] <- log_r1[[i]][order(log_r1[[i]]$time), ]
}
}
Solutions Lines 1 and 2: obtain the first 500
time_id for stock 1 and initialise a list to store the log
returns. Line 3: start a for loop. Lines 4 - 7: extract WAPs to compute
log returns and create a data frame storing the log returns and the
corresponding time points. Lines 8 - 13: find the time points at which
the WAP doesn’t change; thus the log returns are 0 at these time points.
We add these zero log returns into the data frame to facilitate later
analysis.
[b] Divide the 10-min horizon into non-overlapping 30-second intervals. Similar to the above code, use a list to store the volatility every 30 seconds of stock 1 for different time buckets.
Suggestions
vol <- list()
comp_vol <- function(x) {
return(sqrt(sum(x ^ 2)))
}
for (i in 1 : length(log_r1)) {
log_r1[[i]] <- log_r1[[i]] %>% mutate(time_bucket = ceiling(time / 30))
vol[[i]] <- aggregate(log_return ~ time_bucket, data = log_r1[[i]], FUN = comp_vol)
colnames(vol[[i]]) <- c('time_bucket', 'volatility')
}
[c] visualize the 10-min time series for log returns
and realised volatility of stock 1 with time_id == 5.
Suggestions
ggplot(data = log_r1[[1]], aes(x = time, y = log_return)) + geom_line()
ggplot(data = vol[[1]], aes(x = time_bucket, y = volatility)) + geom_line() + geom_point()
In this lab, we will examine three potential methods to predict future volatility: regression analysis, HAV-RV model, ARMA-GARCH model. The three methods use different strategies to fit the data. The regression analysis doesn’t treat the realised volatility as a time series but puts weights based on recency. The HAV-RV directly works on the time series of realised volatility. The ARMA-GARCH models the time series on log returns and indirectly predict the volatility.
Before fitting the data, we conduct a preliminary analysis on the autocorrelation function (ACF) of the time series on log returns and realised volatility. Autocorrelation is the correlation between a time series with a lagged version of itself. The ACF starts at a lag of 0, which is the correlation of the time series with itself and therefore results in a correlation of 1.
[a] Plot the ACF of the time series on log returns
and realised volatility, respectively, for stock 1 with
time_id == 5.
Suggestions
acf(log_r1[[1]]$log_return, main = "ACF plot for log returns")
acf(vol[[1]]$volatility, main = "ACF plot for realised volatility")
We now regress the volatility in period \(t\) on the mean prices in period \(t-1\), mean number of orders in period
\(t-1\), and the mean BidAskSpread in
period \(t-1\). We still use stock 1
with time_id == 5 for illustration.
[a] We use the data during the first 8 minutes of
each 10-min time bucket as the training data and use the remaining 2-min
data for validation. Create two lists vol.train and
vol.val that separately store the training and validation
data in vol. The \(i\)th
element of vol.train (vol.val) corresponds to
the training (validation) data for vol[[i]].
Suggestions
vol.train <- list()
vol.val <- list()
for (i in 1 : length(log_r1)) {
vol.train[[i]] <- vol[[i]][1:16, ]
vol.val[[i]] <- vol[[i]][-(1:16), ]
}
[b] Recall that we study the first 500
time_id for stock 1, and each time_id
corresponds to a specific 10-min time bucket. Create a list with length
500, where each element of the list is a data frame for a different
time_id. All the data frames have four columns: the
realised volatility in the training data set and the corresponding mean
prices, mean number of orders, and mean BidAskSpread in the previous
period. (We don’t need the first realised volatility in any 10-min time
bucket because there is no covariates for this period.)
Suggestions
list.reg <- list() # list for regression
stock1 <- stock1 %>% mutate(time_bucket = ceiling(seconds_in_bucket / 30),
num_order = bid_size1 + ask_size1 + bid_size2 + ask_size2)
len.train <- length(vol.train[[1]]$volatility)
for (i in 1 : length(vol)) {
stats.bucket <- stock1 %>%
filter(time_id == time_IDs[i] & time_bucket != 0) %>%
select(c(BidAskSpread, WAP, num_order, time_bucket))
# for each 30-sec time bucket, we compute the following statistics
mean.price <- aggregate(WAP ~ time_bucket, data = stats.bucket, FUN = mean)
mean.order <- aggregate(num_order ~ time_bucket, data = stats.bucket, FUN = mean)
mean.BAS <- aggregate(BidAskSpread ~ time_bucket, data = stats.bucket, FUN = mean)
list.reg[[i]] <- data.frame(volatility = vol.train[[i]]$volatility[-1],
price = mean.price$WAP[1:(len.train - 1)],
order = mean.order$num_order[1:(len.train - 1)],
BidAskSpread = mean.BAS$BidAskSpread[1:(len.train - 1)])
}
[c] Use weighted least squares to fit the model when
time_id == 5. Assume we are now at 8 min of a 10-min
bucket, and we weight the observations by recency with weight \(0.8^{8-t}\) (time unit: min). Use a list
with length 500 to store all the models, where each element of the the
list is a linear regression model of one of the 500
time_id.
Suggestions
lm.models <- list()
for (i in 1 : length(vol)) {
lm.models[[i]] <- lm(volatility ~ price + order + BidAskSpread, list.reg[[i]],
weights = 0.8 ^ (((len.train - 2):0) / 2))
}
# for some periods, linear regression performs well
summary(lm.models[[162]])
##
## Call:
## lm(formula = volatility ~ price + order + BidAskSpread, data = list.reg[[i]],
## weights = 0.8^(((len.train - 2):0)/2))
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.608e-04 -5.448e-05 3.796e-05 6.759e-05 1.985e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.557e-01 1.274e-01 -3.578 0.00433 **
## price 4.559e-01 1.271e-01 3.588 0.00426 **
## order -1.334e-07 6.337e-07 -0.211 0.83711
## BidAskSpread -1.277e-01 4.033e-01 -0.317 0.75752
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0001356 on 11 degrees of freedom
## Multiple R-squared: 0.7411, Adjusted R-squared: 0.6704
## F-statistic: 10.49 on 3 and 11 DF, p-value: 0.001477
The HAV-RV model is popular in predicting long-term volatility and may not perform well in short-term prediction. Nevertheless, we include it as a benchmark in the lab. Since the realised volatility time series only has 20 time points, we only use the volatility at \(t-1\) and mean volatility from \(t-5\) to \(t-1\) for fitting the volatility at time \(t\).
[a] Similar to Question (b) in Section 2.2, create a
list length 500, where each element of the list is a data frame for a
different time_id. All the data frames have three columns:
the realised volatility and the corresponding realised volatility in the
previous period and the mean realised volatility during the previous
five periods. (The first row starts from the 6th time points for each
10-min time bucket because we need the mean realised volatility in the
past 5 periods.)
Suggestions
list.HAV <- list()
for (i in 1 : length(vol)) {
mean.vol <- rep(0, len.train - 5)
for (j in 1 : 5) {
mean.vol <- mean.vol + vol.train[[i]]$volatility[j : (j + len.train - 6)] / 5
}
list.HAV[[i]] <- data.frame(vol = vol.train[[i]]$volatility[-(1:5)],
vol_1 = vol.train[[i]]$volatility[5:(len.train - 1)],
mean_vol_5 = mean.vol)
}
[b] Fit the HAV model by using ordinary least
squares (OLS) and weighted least squares (WLS). For WLS, we use \(w_t=\text{RV}_{t-1}/\sqrt{\text{RQ}_{t-1}}\)
as the weight for each time period \(t\). Similar to Question (c) in Section
2.2, use two lists named HAV.ols.models and
HAV.wls.models (each with length 500) to store all the
models for every 10-min time bucket.
Suggestions
quar <- list()
comp_quar <- function(x) {
return(length(x) / 3 * sum(x ^ 4))
}
for (i in 1 : length(log_r1)) {
quar[[i]] <- aggregate(log_return ~ time_bucket, data = log_r1[[i]], FUN = comp_quar)
colnames(quar[[i]]) <- c('time_bucket', 'quarticity')
}
HAV.ols.models <- list()
HAV.wls.models <- list()
for (i in 1 : length(vol)) {
HAV.ols.models[[i]] <- lm(vol ~ vol_1 + mean_vol_5, list.HAV[[i]])
# weight.HAV <- 0
HAV.wls.models[[i]] <- lm(vol ~ vol_1 + mean_vol_5, list.HAV[[i]],
weights = list.HAV[[i]]$vol_1 /
sqrt(quar[[i]]$quarticity[5:(len.train - 1)]))
}
# HAV-RV performs well for some time buckets
summary(HAV.wls.models[[218]])
##
## Call:
## lm(formula = vol ~ vol_1 + mean_vol_5, data = list.HAV[[i]],
## weights = list.HAV[[i]]$vol_1/sqrt(quar[[i]]$quarticity[5:(len.train -
## 1)]))
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.0056587 -0.0015535 0.0000333 0.0015861 0.0050228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.002436 0.000420 5.801 0.000405 ***
## vol_1 -0.188253 0.231292 -0.814 0.439242
## mean_vol_5 -4.691461 1.180870 -3.973 0.004102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003545 on 8 degrees of freedom
## Multiple R-squared: 0.856, Adjusted R-squared: 0.82
## F-statistic: 23.77 on 2 and 8 DF, p-value: 0.0004302
The ARMA-GARCH model directly works on the time series of log returns.
[a] Use rugarch to fit the log return
time series from 0 to 480 seconds (first 80% of the data) for each
time_id. Specifically, we use \(\text{ARMA}(1, 1)\) and \(\text{GARCH}(1, 1)\) for conditional mean
and variance, respectively, and adopt a normal distribution for noises.
Like Questions (c) in Section 2.2 and (b) in Section 2.3, use a list
with length 500 to store all 500 models (corresponding to the 500
time_id).
Hint: rugarch has a function ugarchfit to
fit an ARMA-GARCH model. There are many algorithms for fitting the data,
and an example is to use the hybrid solver for this
function.
Suggestions
spec <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1, 1)),
mean.model = list(armaOrder = c(1, 1)),
distribution.model = "norm")
ARMA_GARCH.models <- list()
for (i in 1 : length(vol)) {
ARMA_GARCH.models[[i]] <- ugarchfit(spec = spec, data = log_r1[[i]] %>%
filter(time <= 480) %>% pull(log_return),
solver = 'hybrid')
}
[b] For each of the 500 time_id of
stock 1, simulate 1000 paths of log returns in the next 30 seconds and
compute the corresponding realised volatility. Average them to get an
estimate of the realised volatility in the next period.
Hint: Type ? ugarchpath to see how to simulate a path
based on a fitted ARMA-GARCH model.
Suggestions
RV.pred <- rep(0, length(vol))
for (i in 1 : length(vol)) {
fspec <- getspec(ARMA_GARCH.models[[i]])
setfixed(fspec) <- as.list(coef(ARMA_GARCH.models[[i]]))
future.path <- fitted(ugarchpath(fspec, n.sim = 30, m.sim = 1000))
# Due to numerical issues, sometimes NA value can be produced
# We simply replace NA value with 0; you may come up with a better idea in your own project
future.path[is.na(future.path)] <- 0
RV.pred[i] <- mean(sqrt(colSums(future.path ^ 2)))
}
We will compute two performance measures, QLIKE and MSE, from the
prediction results under the one-shot estimation scheme for prediction.
For illustration, this lab only computes the prediction performance
measures for the first method in Section 2, i.e., the regression method.
Remember we have stored all the fitted regression models in a list
lm.models in Section 2.2.
[a] Use the fitted linear regression models to predict the realised volatility in the validation data set.
Suggestions
list.reg.val <- list()
len.val <- length(vol.val[[1]]$volatility)
pred.lm <- list()
for (i in 1 : length(vol)) {
stats.bucket <- stock1 %>%
filter(time_id == time_IDs[i] & time_bucket != 0) %>%
select(c(BidAskSpread, WAP, num_order, time_bucket))
mean.price <- aggregate(WAP ~ time_bucket, data = stats.bucket, FUN = mean)
mean.order <- aggregate(num_order ~ time_bucket, data = stats.bucket, FUN = mean)
mean.BAS <- aggregate(BidAskSpread ~ time_bucket, data = stats.bucket, FUN = mean)
list.reg.val[[i]] <-
data.frame(volatility = vol.val[[i]]$volatility,
price = mean.price$WAP[len.train:(len.train + len.val - 1)],
order = mean.order$num_order[len.train:(len.train + len.val - 1)],
BidAskSpread = mean.BAS$BidAskSpread[len.train:(len.train + len.val - 1)])
pred.lm[[i]] <- predict(lm.models[[i]], newdata = list.reg.val[[i]])
}
[a] Compute the QLIKE and MSE for all data points in the validation data set. Aggregate the results and plot them as box plots.
Suggestions
MSE.lm <- vector()
QLIKE.lm <- vector()
for (i in 1 : length(vol)) {
MSE.lm <- c(MSE.lm, mean((vol.val[[i]]$volatility - pred.lm[[i]]) ^ 2))
QLIKE.lm <- c(QLIKE.lm, mean(vol.val[[i]]$volatility / pred.lm[[i]] -
log(vol.val[[i]]$volatility / pred.lm[[i]]) - 1))
}
boxplot(MSE.lm, horizontal = TRUE)
boxplot(QLIKE.lm, horizontal = TRUE)